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We present a framework for obtaining explicit bounds on the rate of convergence to equilibrium 
of a Markov chain on a general state space, with respect to both total variation and Wasserstein 
distances. For Wasserstein bounds, our main tool is Steinsaltz's convergence theorem for locally 
contractive random dynamical systems. We describe practical methods for finding Steinsaltz's 
"drift functions" that prove local contractivity. We then use the idea of "one-shot coupling" to 
derive criteria that give bounds for total variation distances in terms of Wasserstein distances. 
Our methods are applied to two examples: a two-component Gibbs sampler for the Normal 
distribution and a random logistic dynamical system. 
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1. Introduction 

In many theoretical or applied problems involving positive recurrent Markov chains, it 
is important to estimate the number of iterations until the distribution of the chain is 
"close" to its equilibrium distribution. Suppose we have a Markov chain with state space 
X, initial state x, transition probability kernel P and limiting stationary distribution tt. 
We would like a quantitative bound such as 

d{P^{x,-),^{-))<g{x,n), 

where d is a metric on the set of probability measures and g{x, n) is a function that can 
be computed explicitly. For example, knowledge of such a function g can be valuable to 
Bayesian statisticians using Markov chain Monte Carlo (MCMC) approximations because 
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it tells them how many MCMC steps will ensure a good approximation to the posterior 
distribution under consideration. An excellent survey on the theory of general state space 
Markov chains and MCMC is [19]. 

An important technical point is the specification of the metric d on the set of probability 
measures. Two common choices are the total variation (TV) metric (denoted dxv) and 
the Wasserstein metric (denoted dw); see Section 2 for definitions and basic properties 
of these two metrics. 

There is a rich literature on Markov chain convergence in total variation distance. Many 
tools have been developed for convergence in TV, involving probabilistic methods (for 
example, coupling, strong uniform times; see [5, 13, 19] for reviews), analytic methods 
(spectral analysis, Fourier analysis, operator theory; see [5, 21]) and geometric methods 
(path bounds, isoperimetry; see [13, 21]). Much of the progress, and many of the sharpest 
results, have been for discrete state spaces [5, 13, 21], including spaces related to graphs, 
algebraic structures, or models from statistical physics. Some results extend to general 
state spaces, but some basic discrete properties and methods do not have convenient 
analogs in the general case. Continuous state spaces are of particular interest in Bayesian 
MCMC applications [10, 19], but quantitative rigorous results about realistic examples 
arc scarce. 

Frequently, the desirable functions g to seek are of the form g{x,n) = C{x)r"', where 
C{x) and r can be computed explicitly. The existence of such a function for the TV 
metric is called geometric ergodicity and is known to hold under fairly general conditions 
(see, for example, [16, 17]). Explicit identification of such functions can be an intri- 
cate task, however. A classical result in this context is due to Doeblin: if there exists 
a probability measure v and < e < 1 such that P[x,Ay) > £v{Ay) for every a;, then 
drviP^ix, ■),tt) < (1 — e)". It is possible to get similar bounds using coupling when Doe- 
blin's condition holds only on a subset K, if a "drift function" to K exists. More precisely, 
one needs (i) P{x,dy) > ei/(dy) for all x £ K; (ii) a function V > 1 and a constant a > 1 
such that E{V{Yn+i)\Yn = y) < V{y)/a for all y G K'^. These conditions are called (i) 
minorization and (ii) drift conditions [16, 20]. For practitioners who want to implement 
these conditions, the challenge is to identify such a set K and a drift function V that 
lead to tractable calculations and good results. See [11] for an impressive application of 
these conditions to a Bayesian random effects model. A good survey and another realistic 
application is in [14]. 

Coupling arguments for proving TV bounds typically use two coupled versions of a 
Markov chain that coalesce relatively quickly. This is often technically easier to do in 
discrete state spaces than in state spaces with no atoms. Minorization and drift con- 
ditions offer one solution to this difficulty: coalescence is facilitated when the coupled 
chains are simultaneously in the set K. However, in many situations, it may be hard 
to force coupled chains to coalesce, but it may be easier to force them to come (and 
stay) very close to each other. Closeness of two chains in the metric of the state space 
roughly corresponds to closeness of their distributions in the Wasserstein distance. For 
this reason, the Wasserstein distance can be a tractable alternative to the total variation 
distance for problems in continuous state spaces (see, for example, [8]). Although Wasser- 
stein convergence can be weaker than TV convergence, we shall show that under certain 
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conditions, bounds on the rate of Wasserstein convergence can be used to get bounds 
on the rate of TV convergence (see Section 4). Thus, proving Wasserstein convergence 
is sometimes a step toward proving TV convergence. Huber [12] also uses this general 
philosophy, employing rather different methods from ours. 

A particularly successful framework for studying convergence in Wasserstein distance 
is random dynamical systems, or iterated function systems [6, 22]. An iterated func- 
tion system is a sequence of random maps of the form Fn{x) = /i o /2 o • • • o fn{x) 
or Fn{x) = fn ° fn-i o • • • o /i(x), where /i, /2, . . . are independent and identically dis- 
tributed (i.i.d.) random maps. (Two examples are described later in this section.) The 
sequence {Fn{x):n> 1} is called the forward sequence and is a Markov chain. Many 
examples of Markov chains can be represented as forward iterates of i.i.d. random maps. 
{Fn{x) : n > 1} is called the backward sequence and, under certain conditions, it converges 
pointwise to a random variable, Xoo, independent of the starting point x. If Xcx, exists, 
in which case the system is called attractive, the distribution of Xoc is also the stationary 
distribution tt of the Markov chain Fn{x). The rate at which E [p {Fn (x) , Xoo)] converges 
to zero is an upper bound on the rate of convergence in distribution of the Markov chain 
Fn{x) to TT in Wasserstein distance. Indeed, since Fn{x) has distribution P"(a;, •) (as docs 
Fn{x)) and since Xao ~ tt, wc have 

dw(P"(a;,-),7r)<£;[p(^^„(x),Xo,)]. (1) 

One condition that guarantees attractivity is strong contractivity, that is, _B[logLip /] < 
0, where Lip / is the Lipschitz constant of the (random) function /. This condition is a 
generalization of the stronger condition that there exists a constant r G (0,1) such that 
p(f{x),f{y)) < rp{x,y) for all x and y, with probability 1. (Gibbs [8] used a variation of 
this condition to get a bound for the Wasserstein distance of a Markov chain X„ to its 
stationary distribution using coupling. See also [6] for a related result.) However, appli- 
cations frequently require weaker conditions. Steinsaltz [22] proves attractivity under a 
more general condition, called "local contractivity" , which says that there exists a "drift 
function" (p-.Xi-^ [1, cx^) and a constant r G (0, 1) such that 

G„ix):^E[D,Fn]<<j)ixy\ 

where D^f := limsup^^j, '^^'^p(2'^)^'*'' ■ He proves that if local contractivity holds, then 

E[p{Fr,{x),X^)] < C^r"" for every n>l, 

where Cx is a number that can be computed explicitly; see Section 3.1 for further discus- 
sion. Steinsaltz's use of the term "drift" is analogous to, but different from, Rosenthal's 
use (which, in turn, is closely related to Foster-Lyapunov functions; see [7] for a review 
and references). 

Like the minorization and drift conditions, the local contractivity condition requires 
preliminary work to obtain a drift function. The goal of the first part of this paper 
(Section 3) is to provide a systematic framework for doing this. 
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We developed our methods using two examples. The first is a simple Gibbs sampler 
chain for Bayesian estimation of the mean and variance of a Normal distribution. The 
second example is a randomized version of the classical logistic map from dynamical 
systems theory. 

The paper is organized as follows. The remainder of this section is devoted to descrip- 
tions of our two main examples. Section 2 provides definitions and basic properties of the 
Wasserstein and total variation metrics. Section 3 examines the task of finding a drift 
function that produces quantitative bounds on Wasserstein convergence. Section 3.1 re- 
views the results of Steinsaltz [22] and Section 3.2 presents an approach to finding drift 
functions by looking for sub-eigcnfunctions of a certain dominating operator. Section 3.3 
then uses this approach to find drift functions for our Gibbs sampler example. Section 4 
shows how bounds on the Wasserstein metric may be "upgraded" to bounds on the total 
variation metric in some situations. Section 4.1 reviews the idea of "one-shot coupling" 
[18] and presents our key technical result (Theorem 12). Sections 4.2 and 4.3 apply this 
result to our two examples. 

Example 1 (Normal Gibbs sampler). A simple Bayesian estimation problem is the 
following. Consider a random sample of size J from the Normal distribution with mean 
6 and variance cr^ (written N{9,a'^)). We assume that 9 and S := are themselves 
independent random variables from Normal and Gamma prior distributions respectively: 

e^N{C,K-^) and := cr"^ - r(a, /3). 

(Here, r(Q;, (3) is the Gamma distribution with density s"~^/3" exp(— /3s)/r(a).) Let Y := 
Fi, . . . , Yj be our random sample from N{9,a'^) (conditionally independent, given 9 and 
(t). The joint posterior for 9 and S given Y is 



p(6',s|y)(xs"-^+"'/2exp 



-/3s- 



^2^ 



(2) 



(where ^ is the sum over j from 1 to J). Besides positive values of K, we shall also 
consider the case K = 0. When K = 0, the prior for 9 is not a probability distribution; 
however, the joint posterior is a probability distribution. (We can view K = as the "flat 
prior" limit K — > 0-I-. The case /3 = is similar.) The Gibbs sampler is the Markov chain 
{9t,St) defined recursively by drawing 9t from its conditional distribution given Y and 
5* = S't-i, followed by drawing St from its conditional distribution given Y and 9 ~ 9t: 



7* 



N 



St-iJ + K 'St-iJ + K 

5t^r(a + ^,/3 + i^(y,-0,)' 



We can represent this procedure as follows: 

Zt , St^.ZY, 

^St-iJ + K St-iJ + K 



^^. = ^=J_ + ^^^^^i±^, where Z.^iV(0,l), (3) 
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St = ,—5^ , where Gt ^ Via + J/2,1) (4) 

(and {Zt} and {Gt} arc independent i.i.d. sequences). Let 



(we treat these as constants, since we always condition on Y). Since 

^(r, - ef = ^(y, - y)^ + j{y e)\ (5) 



we can write equation (4) as 



Using equation (3), we can express (6) as a random dynamical system, as follows: 

St = ft{St-i), t = l,2,..., (7) 
where ft : (0, oo) — > (0, oo) is the random function 

ft (s) = = (8) 

' ' i:o + {Jl2){Zt/V^JTK+{i-Y)K/{sJ + K)Y ^' 

with the random variables Gt and Zt as above. The case K = is of special interest 
(representing an improper prior for 9) and equation (8) specializes to 

We note that the posterior (2) is a proper probability distribution when X = 0, even 
though the prior is not (to see this, use (5) and integrate 9 first). 

Without loss of generality, we can assume that ^ is zero and that K is either or 
1. (Indeed, if A' > 0, then we can let 9 = {9 - ^)Vk, Yi = (Y, - Ov^, cr^ = AV^ and 
13 = A:/3; then Y, ^ N{9, ct^), where 9 ^ N{0, 1) and a'^ ^ r(a, (5).) Accordingly, for our 
Markov chain {5*4} with AT G {0, 1}, let Pk be the chain's transition probability kernel, 
let pk{'i •) be the density of Pk and let ttk be the stationary distribution. 

We shall obtain quantitative bounds for the convergence of our Gibbs sampler chain Pk 
(AT e {0, 1}); see Propositions 11 and 14, and the discussions of numerical results following 
each. Roberts and Rosenthal [18] analyzed this chain with flat priors, that is. A' = ^ = 
^ = and a = 1. In particular, their results show that limsup„_j.o^ [dTY{PQ{x, •), ttq)]^/" < 
1/J. This would equal our asymptotic rate if we could replace if by 1 in Proposition 14. 
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The analysis of [18] uses the property that the recursion for 1/St is a linear function of 
1/5*4- 1, which only holds when K = 0. Their approach cannot handle the case K > 0. 
Our method of Section 4 may be viewed as a more powerful (nonlinear) generalization 
of [18]. 

Example 2 (Random logistic map). We consider the i.i.d. random maps /i,/2,... 
on [0, 1] defined by 

/,(a;)=4B,x(l-a;), 

where Bi,B2,. ■ ■ arc i.i.d. random variables having the Beta(a + i,a — i) distribution. 
Here, a > ^ is a fixed number. It is known that the Beta(a, a) distribution is the unique 
stationary distribution for this iterated function system [3]. Our result for this example 
will provide bounds that are more qualitative than quantitative. Asymptotic convergence 
properties of this example have been studied in the literature. Stcinsaltz [22] showed that 
the system is locally contractive if a > 2 and hence that the corresponding Markov chain 
converges to equilibrium exponentially rapidly in the Wasserstein distance. Using the 
techniques of Section 4, we shall prove the following theorem. 

Theorem 1. Assume that a > 1/2 and let x G (0, 1). There then exists a constant Ca, 
depending only on a, such that 

dTv{Fnix),/3a^a) < Ca[dw (Fn-lix) , (3a,aT^ ''"^'^ for all n > 1 

(where Pa^a is a random variable having the Beta(a,a) distribution). 

Note that Theorem 1 does not assume local contractivity (indeed, local contractivity 
fails if 1/2 < a < 1, by CoroUary 3 of [23] and Theorem 1 of [22]). 

Theorem 1 implies the following. Assume that the random logistic Markov chain 
{F„(x) : n = 0, 1, . . .} converges to its equilibrium exponentially rapidly in Wasserstein 
distance, that is, that there exists a constant p S (0, 1) such that 

limSUp[dw(f'n(x),/3a,a)]'/" < P- (10) 

It then also converges exponentially rapidly in TV distance, perhaps at a modestly slower 
rate: 

limsup[dTv(F„(x),^,,J]i/" < < 1. 

n— ^oo 

Since the state space (0,1) has diameter 1, we trivially have d\fi/{F„ {x), l3aM) < 
dTviFnix), /3a,a)- Hcncc, we conclude that for a > 1/2, our random logistic Markov 
chain converges to the equilibrium exponentially rapidly in Wasserstein distance if and 
only if it converges exponentially rapidly in TV distance. 



888 



N. Madras and D. Sezer 



2. Wasserstein and total variation metrics 

In this section, wc review the definitions and some properties of two metrics on tire space 
of probability measures: the Wasserstein metric and the total variation (TV) metric. For 
a broader review of metrics on probabilities, see [9]. 

Let (x, p) be a complete separable metric space. Consider two probability measures, 
^1 and ^2, on x- Let Joint(^i, /X2) denote the set of all probability measures M on x x X 
whose marginal distributions are fj,i and /Z2, that is, 



lii{dx) ^ / M{dx,dy) and /i2(dj/) = / M{dx,dy). 



V 



In other words, if two random variables Xi and X2 have distributions fii and fi2^ respec- 
tively, then Joint(/ii, /i2) is the set of all "couplings" of Xi and X2. 

The Wasserstein distance between /.ti and /i2, denoted dwifJ-i, is defined to be 

(^w(a«1: M2) = inf J p{x,y)M{dx,dy):M e 3o'mt{iJ.i,fi2)^- (H) 

In other words, dw(/^ijM2) is the infimum of E{p{Xi, X2)) over all couplings of Xi and 
X2 (where Xi ^ fXi). It can be shown that there exists an M that attains the infimum 
(see, for example, Section 5.1 of [4]). 

The total variation (TV) distance between /ii and /i2, denoted d^y {fii , (12) , is defined 
to be 



xtv 



i^il,^l2) = sup{\pi{A) - p2iA)\ -.Acx}- (12) 



This sup is attained by some set A (by the classical Hahn decomposition for the signed 
measure pi — ^2)- An equivalent definition of c?tv is 

d'TY{pi,P2) = inf{Af({(a;, J/) :X7^?/}) : A/e Joint(/^i,/X2)}. (13) 

In other words, rfTv(MiiM2) is the infimum of Pr{Xi 7^X2} over all couplings of Xi 
and X2 (where Xi ~ pi). For convenience, we shall sometimes talk about the Wasser- 
stein or TV distance between two random variables, which means the same thing as the 
Wasserstein or TV distance between their distributions. 

The following is relatively well known (see, for example. Theorem 5.7 of [4] or Propo- 
sition 3 of [19]). 

Proposition 2. Assume that pi and p2 are probability measures on x, having density 
Junctions pi and p2, respectively, with respect to a common reference measure A. Then 

rfTv(Mi,M2) = ^ J \pi{z) - P2{z)\X{dz) (14) 

{p,{z)-p2{z))X{dz) (15) 

z : pi{z}>p2(z) 



Wasserstein and TV convergence of Markov chains 



889 



= 1- / mui{pi{z),p2{z)}X{dz). (16) 

If the state space x is bounded, then (iw(Mi7A*2) < dTyini, ^^2) x [sup{/ci(a;,y) :x,y & 
xW and, in particular, TV convergence imphes Wasserstein convergence. However, in 
general, neither convergence imphes the other. For example, in M, let be the two- 
point probability distribution that has /i„({0}) = 1 — and ^„({n}) = . Then /i„ 
converges to the point mass at in the TV metric, but not in Wasserstein. Also, let 
be the probability distribution on [0, 1] with density 1 +sin(27Tnx); then Vn converges to 
the uniform distribution on [0, 1] in Wasserstein, but not in TV. 

The following result will be very useful in Section 4. 

Lemma 3. Consider a deterministic measurable function g:Ax B ^ C . Let Wi and 
W2 be two B-valued random variables and let U be an A-valued random variable that 
is independent of both Wi 's. Define the C -valued random variables Xi and X2 by Xi = 
g{U,W^), i==l,2. Then 

dTv(^l,^2)<dTv(VFi,VF2). 

Proof. Choose a joint distribution M {Awi , AW2) of a random vector [Wi , W2) on B x B 
such that W = for i = l,2 and M{Wi ^ W2} = dTv(M^i, ^"2)- Also, make (Wi^m) 
independent of U and let Xi — g{U,Wi). Then Xi = Xi for i = 1, 2, so 

dTv{Xl,X2) < M{Xl ^X2}< AI{Wi ^W2}^ dTv{Wi,W2). □ 



3. Convergence in the Wasserstein metric 

3.1. Local contractivity condition and a convergence theorem 

Our main tool to obtain quantitative bounds for convergence in Wasserstein metric will 
be Steinsaltz's local contractivity convergence theorem [22]. Below, we review this result 
in a form convenient for us. 



Definition 4- An iterated function system is locally contractive if there exists a function 
(/) : A" I— > [1, 00) and r e (0, 1) such that 

Gnix) E[D^Fn] < (?i(a;)r" for all n > 1, 
where D^f :=limsupj,^^ ''^^^^[xy^^^ ■ V tf^'i-s holds, then (f> is called a drift function. 



Theorem 5. If an iterated function system is locally contractive with a drift function 
and if 



p{f{x),x) sup {(t){x + t{f{x)-x))} 

0<t<l 



< 00, 
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then the system is attractive ( in particular, Foe (x) is independent of x) and 

C r" 

d^fg{Fn{x),Foo{x)) < Ep{Fn{x), Foc{x)) < — ^ — for cvcry x€x- 

1 — r 

Steinsaltz [22] also gives a sufficient condition, called the growth condition, for a func- 
tion to be a drift function: a continuous function (j):Xi-^ [l,oo) is a drift function if 
r < 1 , where 



: supi? 



0(l) 

Here is a short argument (different from the original proof in [22]) to explain why. Let C 
be the positive linear operator which maps a generic function g to the function C{g){x) = 
E[g{f{x))Dxf]- Then Gn{x) =£"(l)(a;), with 1 here being the constant function equal 
to 1. Note that the growth condition is equivalent to £0 < rcf). We will refer to any 
> satisfying C(j) < r(j) as an r-suh-eigenf unction for C Now, if </> > 1 and (j) is an 
r-sub-eigenfunction, then Gn{x) = £"1 < £"0 < r"0 and hence </> is a drift function with 
rate r. 

We note that Proposition 8 of [23] shows that the existence of a satisfying the growth 
condition is also necessary for local contractivity. 



3.2. How to apply the local contractivity convergence theorem: 
Finding a drift function 

Applying Steinsaltz's local contractivity convergence theorem to a specific problem would 
be easy if one knew how to write down a drift function. Here, we will propose two practical 
strategies that can help us to do this. 

The first strategy is to find a linear operator C that dominates £ and is simpler to 
manage. If (j) is an r-sub-eigenfunction for £, then it is an r-sub-eigenfunction for £ as 
well. 

One kind of operator that we can manage is defined as follows: let be a finite 

partition of the state space x and let 

n „ 

Ccj,{x)=b{x)Y,U,{x) / cj,{s)fi,{ds), (17) 
i=l "'x 

where b{x) is a positive function and each fii is a non-zero finite measure on x- 

Theorem 6. Let £ be an operator of the form (17). In order for £ to have an r-sub- 
eigenfunction, it is necessary and sufficient that the matrix 



Qihj) = / b{x)fi,{dx) 
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has an r -sub-eigenvector p = {pi,p2, ■ ■ ■ ,Pn)'; that is, pi > Vi and Qp < rp. Moreover, 
if p is an r- sub -eigenvector for Q, then the function 



(and any positive multiple of it) is an r-sub-eigenfunction for C. 



(18) 



Proof. If (p is an r-sub-eigenfunction of C, then b{x)J2]=i ^Aj{x) J 0(dc)/Ltj (c) < r(j){x), 
by definition of £. Integrating both sides with respect to /i,; gives 

&(x)/Xj(da;) J <f){c)^j{dc) <r J (t>{x)n^{dx). 

Therefore, the vector p defined by pi :— J (fi^i is an r-sub-eigenvector for Q. Conversely, 
if p is an r-sub-eigenvector for Q and if (j) is as defined in (18), then 

n n n 

C(f){x) = b{x)'^lAi{x)'^Pj / b{s)fii{ds) <b{x)'^lAi{x)rpi=r(f){x). 



1=1 



Hence ip is an r-sub-eigenfunction and so is any positive multiple of it. 
For the case n = 1, Theorem 6 implies the following. 



□ 



Corollary 7. Assume that b is a positive function and ^ is a finite measure such that 
C(/){x) < b{x) (/){s)^(ds) for every a; € x and every positive (f). Let r = J 6(s)/i(ds). Then 
b is an r-sub-eigenfunction for C 

Note that for an r-sub-eigenfunction to be a drift function, it must be greater than 
1. If (j) is bounded away from 0, we can get a drift function simply by scaling (p. However, 
if (j) is not bounded away from 0, we first need to truncate it, as in the following lemma. 

Lemma 8. Let (j) be an r-sub-eigenfunction for C Let s > and define 



(f>e{x) = — 'max{(f>{x) , e} . 



(19) 



Define Aq :~ sup.^ E[^^^] and r^ r + eAq, and assume that Aq < oo. Then cj)^ is an 
r;;- sub- eigenf unction for C 



Proof. Since (/)e(a;) > 1 for every x and ^^^^^ - — 



, we have 



E 



D.f 



<E 



cPix) 



-eE 



Dxf 



<r + eAo. 



(20) 
□ 
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The second strategy is to switch to an easier operator, analogously to switching from 
one measure to another by the use of a Radon-Nikodym derivative. 

Lemma 9. Assume that a positive linear operator Ci has the integral representation 
£i(0)(a;) = / 4>{y)K{x,dy) and let C2{4>){x) = 4){y)h{y)K{x,dy) , where h is a 

strictly positive function. Then (j) is an r-sub-eigenfunction for Ci if and only if ^ is 
an r-sub-eigenfunction for C2. 

Proof. It is enough to prove one direction only. Let (j) be an r-sub-eigenfunction for Ci . 
Then 

In particular, this lemma tells us that if r := sup^, i^(a;, x) < 1, then 1/h is an r-sub- 
eigenfunction for £2 • 



3.3. Example 1: Normal Gibbs sampler 

We shall use the techniques of Section 3.2 to find drift functions for the Gibbs sampler 
example of Section 1. Recall that, without loss of generality, we assume that K = or 1 
and ^ = 0. The following proposition gives three different drift functions that are valid 
under different conditions on the parameters and the data Y. It should be clear that other 
drift functions are possible; also, the bounds r^ can be tightened somewhat at the cost 
of additional effort and/or more complicated expressions. For numerical illustrations, see 
the remarks following the proof of Proposition 11. 

Proposition 10. (i) For given K > 0, let 



{a + J/2)i\Y\VK + l){\Y\VK+l/2) {\Y\V K + mYW K + 1/2) 

A := ^ ana ri := ; . 

Eg a -f J/2-1 

// ri < 1, then for any e such that ri.^ ri + eA < 1, (l)i^^{x) := - max(e, ^) is a drift 
function with rate ri^^. 

(ii) Assume K^l. Let ra {a-^^)^{\Y\ + l)i\Y\ + i). // rs < 1, then 02(x) = 1 
is a drift function with rate ri ■ 

(iii) Assume K = 1. Define 

~ (|y|-H)(a + J/2)Jv^ J ( 2\Y\ , 1 ^ 

' '^"^•=7l^l(.j+i)3/^ + :^jTTj 



and 



1^3 ■■= ^= 4 y 1 , + log -\- 1 

V2^V V ^J{a + J/2)/j:o + lJ V So 
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// ra < 1, then for any e such that r^^^ :~ r^ + eA < 1, the function (jj^^i^^x) = 
i max(e, &(a;)) is a drift function with rate r^,^. 

Proof. The idea of the proof is that for each case, we find a sub-eigenfunction (/> for the 
operator C and, if necessary, we truncate </>, as in Lemma 8, to obtain a drift function. 
Recall £((/)) (x) = E[(l){f{x))D^f], where 



So + {J/2){YK/{xJ + K) - ZjsJxJ + 

and G and Z arc two independent random variables with r(Q! + J/2,1) and A^(0,1) 
distributions, respectively. We shall frequently use (without reference) the following two 
easy calculations for G and Z . First, the definition of the Gamma distribution implies 
that 

-<-^^^^ ..■.>-(..f)^ 

Second, for all constants a,6, c, d, the Schwarz inequality and E{Z'^) = 1 imply 



E{\a + bZ\\c + dZ\) < ^a^+b^^c^+d^< {\a\ + \b\){\c\ + \d\). (23) 

(i) The local Lipschitz constant D^f is equal to the absolute value of the derivative / 
at X, so, by direct computation, 

^ _ GJ^\YK/{xJ + K)- Z/^/xJ + K\\YKl{xJ + Kf - Z/{2{xJ + Kf'^)\ 

(So + {J/2){YK/{xJ + K)- ZjsJxJ + KYY ' ^ ' 

Let kx be the joint distribution of f{x) and Dx, where 

~ _ J^\YK/{xJ + K)- Z/VxJ + K\\YK/{xJ + Kf - Z/{2{xJ + Kfl^)\ 

G 

and let K^idc) = x^ij^^y^^ yk^idc, dy)). Note that f{x)'^Dx = D^f. Therefore, 

Cmx)=E[ci,{f{x))D.J]^E[<l,{f{x))f{xfbx] ^J^J cp{c)h{c)Kx{Ac), 

where h[c) = c^ . Let £i be the operator defined by Ci4>{x) := J (l){c)Kx{dc) and let 
£2 = 'C. By Lemma 9, we see that if (f> is an r-sub-eigenfunction for Ci then ^ is an 
r-sub-eigenfunction for C2 ^ C We find that 



/>oo 

sup / Kx{dc) = sup x^E[Dx 

X Jo X 
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If ri < 1, then (l>{x) = 1 is an ri-sub-eigenfunction for Ci and hence (piix) ^ x ^ is an 
fi-sub-eigcnfunction for C. Finally, note that for every x > 0, 

{xJf G\YK/^/xJ + K- Z\\YK/^/xJ + K - Z/2| 



E 







_<l>2{x) 





{xJ + KY (So + {J/2){YK/{xJ + K)- Zl^JxJ + Kff 



<A. 



Hence, by Lemma 8, is a drift function with growth rate less than r\ +eA. 

(ii) When K = 1, sup^ E{Dxf) < r2. If r2 < 1 and we let 4>2{x) = 1 Va;, then C(l)2{x) = 
E{Dxf) < r2(f>2ix) and thus (f>2ix) is a drift function with rate r2. 

(iii) We first derive a more explicit formula for C and then look for an operator C of 
the form (17) with n = l that dominates C (as in Corollary 7). Note that we can write 



mix) 



Aj;{z,c)hz.f(x){z,c) dz j dc, 
J / 



where hzjtx) is the joint density of {Z, f{x)) and 



A^(Z,C) : 



cJ^\Y/{xJ + 1) - z/V^^JT1\\Y/{xJ + 1)^ - z/{2{xJ + 1)^/^)1 
So + {J/2){Y/{xJ + 1) - z/VxJ+iy 

(observe that Ax{Z, f{x)) —D^f, by (24)). To simplify the formulae, let us put 



Axiz) 



Y 



{xJ + 1) ^xJ + 1 ' 



S,(Z): 



Y 



(xJ+l)2 2(a;J+ 1)3/2 



and u^(z) = So + |A^(z)^ 

To find hzj(x)y we consider the mapping Tx{z,g) = {z, g/ux{z)). Note that Tx{Z,G) = 
{Z,f{x)). Tx{z,c) is one-to-one and T^^{z,c) = {z,c{ux{z))). Let D be the Jacobian of 
T^^. We have hz,f(x){z, c) = /iz^G(r~-^(z,c))| det_D| and | det£>| = Ux{z); therefore. 



hzj{x){z,c) = 



r(a + J/2)V27T 



^.(z)e"^^/2(cu.(z))"+-V2-ie-™.(.)^ 



Now, 



Ax{z,c)hzj(x){z,c) dz 



r{a + J/2)V27i\J,<Y/V^ 



^ 1^ Ax{z)Bx{z){cUx{z)Y 



7+1 



c-'-/^Ax{z)Bx{z){cUx{z)r+"^-'c--'' 

I z>Y /s/xj+l 

Substituting u = cux{z) and noting that du = —cJ^^==Ax{z) dz, we get 



(^)dz 



Ax{z,c)hzj{x){z,c) dz 
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J 



T{a + J/2)2V2ny/xJ + 1 



^^a+,//2-lg-« 



-{l/2)(xJ+l)({Y/ix.I+l))+y/{2/J){u/c-^a)f 



Y 



g-(l/2)(:cJ+l)((?/(2;.7+l))-^(2/J)(«/c-So)) 



xJ + 1 V \ c 



2 /u 



xJ +1 V J V c 



So 



dw. 



Using the inequality |fe~'^*^"^+*-'^ | < \A\ + (where A and i are real and C > 

0), we bound the term inside the brackets by + --^==). Hence, C{(l>)(x) < 

h{x) (f){c)H{cTio)Ac, where h{x) is defined in (21) and H is one minus the c.d.f. of 
our gamma variable G, that is, H{x) = Pr{G > x}. 

Next, we compute r = b{c)H{cY,o) dc. Let g be the density of G. Note 



and 



1 - 2 f f 1 

(^7TTj^^(^^")^^-7X V-Vm^Tt 



g{x) dx 



1 



/„°°(xJ/i;o + 1)5(2;) dx 



1 - 



V J(a + J/2)/So + 1 



1 - 1 /'°° /xJ 

H{cJ:o)dc^- log — + l)5(x)dx 



cJ+1 



J 



(25) 



(26) 



(27) 
(28) 



where (25) and (27) follow from Jensen's inequality. Therefore, r < 7-3. We conclude that 
03 is an r3-sub-eigenfunction. 
Using (23), we have 



Y 



Z 



xJ + 1 ^xJ + 1 



Y 



< 



(a + J/2)JV |F| 



(xJ+l)2 2(.T,/+ 1)3/2 



(a;J + 1)3/2 2{xJ + l) 
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(a + J/2)Jx/2^ / \Y\ 
2I]2(a-J + l) y^^ITT 



+ l]b{x), 



where b is defined in equation (21). Hence, sup^ E[Dxf /b{x)] < A. By Corollary 7 and 
Lemma 8, the function 03^^ is a drift function with growth rate less than ra^^. □ 

Proposition 11. Define Ti and ri_^ as in Proposition 10: 

(i) Let K >0 and assume that a + J/2 > 2. If ri.g < 1, then for all x > and all 
n>l, 

d^{P^^{x,-),nK)<^^^r^ 



1-^1,. 



where 



a + J/2 



^max 



ex^ 



+ 



xJ + K 
A{xJ + K) 



xJ + K 

{YKf &{YKf 
(xJ + KV xJ + K 



X (£(a + J/2- l)(a + J/2-2)) 

(ii) Assume K = \. //r2 < 1, then for all x > and all n>l, 

dw(Pi (a;,-),7ri) < 1~/^^ ■ 

(iii) Assume K = 1. If r^^^ < 1, then for all x > and all n>l, 



dw(Pr(a;,-),7ri)<-^^r" 



1 - rs, 



where 



C3,e.,x ■= ma.x< 1 



j(2|r| + i) 



eV2n 



a + J/2 
So 



Proof, (i) If ri,e < 1, then dwiPxix, •),7ri) < ^i^r^^, where 



Cl,e.x ~ E 



\f{x)-x\ sup {0i.e(x + t(/(x) -x))} 
te[o,i] 
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< E 



< E 



(/(x) + x) max 



ex'^ ' ef{xY 



,1 



< E[f{x)+x]E 



^max< 




niax< 


^ ex"^ ' } 



1 



ef{xY 
1 



the last line following from the FKG inequality (see, for example, Theorem 3.17 of [15]) 
since l//(a;)^ is a decreasing function of the random variable f{x). From 



x + E{J{x))<x + 



a + J/2 



So 



(29) 



and (using equation (22) with p ~ —2 > —{a + J/2)) 



E{!{x)-') = E 



z 



2\xJ + K y/xJ + K 
YK Z 



2s 2- 



£;(G-2) 



YK 



xJ + K y/xJ + K 
4 



^xJ + K y/xJ + K 
/((a + J/2-l)(a + J/2-2)), 

and calculation of the expectations in the brackets in the above expression, we find that 
Ci,e^x is an upper bound for Ci,^^x- 

ill) If r2 < 1, then 0(a:) = 1 is a drift function with rate r2. Hence. Theorem 5 implies 
that dwiPiix, < ^r^\ and Ca.^ = E{\f{x) - x\) <x + by equation (29). 



(iii) Ifra,^ < 1, then dw(f'K(^> '), t^i) < j 



andC3,e,a; <£'[/(x)+x] sup (03,^(1/)) < 



C'3,£,x because of (29) and the fact that snpy{(j)3^i;{y)) = max{l, ^^^^|^=-!i}. 



□ 



Remarks. (1) The criterion r2 < 1 is essentially the condition that logsup^ E^D^f) < 0. 
This is similar to the strong contractivity condition which says that E {logsup^ Dxf) < 
0. Logically, neither condition implies the other. Each implies the weaker condition 
sup^ ,yi5(log[p(/(a;),/(y))//9(a;, J/)]) < used in [1] to prove attractivity (in a more re- 
strictive setting). 

(2) In the Baycsian model, as the number of observations J increases, Y and So/ J 
both converge (to 9 and ct^, respectively). Therefore, for large J, we expect ri to be 
small, but r2 and to be large. 

(3) {K = 1) To illustrate the calculations in the preceding propositions, we considered 
some cases with 5 < J < 10, a = 1, 0.5 < ? < 1.5 and 5 < Eg < 60. As shown in Table 1, 
it is possible for any one of ri, r2 or r^ to be less than the other two. 
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(a) In case A, we have r2 = 5/6 and C2.X = x + 0.1. Hence, for x = 1, we have 

dw(^'r(l,-),7ri)<6.6*(5/6)" for n > 1 in case A. 

In particular, dT^{Pi{l, Oi^i) < 0.01 for n > 36 in case A. 

(b) For case B, we have ri = 0.6 and A = 0.21. We want to have ri^^ < where 
r-i^e = 0.6 + 0.2l£. Suppose we choose e ~ 0.5. Then ri,e = 0.705 and Ci^e.x < (16 + 
max{l, 2x~^}){x + 0.7) for all x > 0. For a; = 1, we obtain 

dw(Pr(lr),7ri) < 104*0.705" for n > 1 in case B. 

In particular, dw(-P"(l, Oi^i) < 0-01 for n > 27 in case B. 

(c) In case C, we have < 0.9369 and A < 0.305. Choosing e = 0.01 gives r^^^ < 0.94 
and C^^e^x < 599(a; + 0.3). For x = 1, we obtain 

dwiPi{l,-),TTi) < 12980*0.94" for n> 1 in case C. 

Therefore, dw{P{' {l, ■), tti) < 0.01 for n > 228 in case C. 

(4) (K = 0) Consider the three cases of Table 1, but now using the prior distribution 
with K^O. Table 2 gives the calculations of Propositions 10(i) and ll(i) (note that 
ri = 1/ [2a + J — 2] ) ; the last column is the bound on the Wasserstein distance from 
equilibrium after n iterations, started from a; = 1. We find that dw{Po{^, Oj^o) < 0.01 
for n > 5 in case A and for n > 6 in cases B and C. 

4. From Wasserstein distance to total variation 
distance 

4.1. One-shot coupling 

In this section, we present Theorem 12, our main tool for converting Wasserstein conver- 
gence rates to total variation convergence rates. Various methods of coupling have been 
used for proving convergence in TV distance [5, 13, 19]. Although not explicit in the final 

Table 1. Values of ri, r-2 and ra in three cases of the Normal Gibbs 
sampler with K = \. Observe that r-2, is best in CEise A, r\ in case B and 
rs in case C. Numbers with ". . . " have had trailing digits truncated; 
other numbers are exact 



Case 


J 


a 


Y 


So 


r\ 


r2 


^■3 


A 


10 


1 


1.5 


60 


1 


5/6 


0.97. . . 


B 


5 


1 


0.5 


5 


0.6 


5.25 


1.02. . . 


C 


5 


1 


1 


12 


1.2 


1.82. . . 


0.9368. . . 
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Table 2. Values of expressions from Propositions 10(i) and 11 (i) for the Normal Gibbs sampler 
with K = 0, for the cases given in Table 1. The values of e were chosen somewhat arbitrarily. 
We use X = 1 in all cases. Numbers with ". . . " have had trailing digits truncated; other numbers 
are exact 



Case 


ri 


A 


e 




Cl.e.x 


dw(Po"(l,-),^o)< 


A 


0.1 


1/1200 


1 


0.1008.. 


202.4. . . 


226* (0.101)" 


B 


0.2 


0.07 


0.5 


0.235 


31.28... 


40.9* (0.235)" 


C 


0.2 


0.012... 


1 


0.212. . . 


55.28... 


70.3* (0.213)" 



formulation, the idea behind this theorem is a certain kind of coupling method, called 
one-shot coupling, which has been successfully applied to iterated function systems by 
Roberts and Rosenthal [18] (see also [2, 12]). We describe this method now. 

We shall consider two copies of a Markov chain, running simultaneously. Let 5*0 and 
5*0 be two initial values for this chain (possibly random with some joint distribution). 
Let {ft} be a sequence of i.i.d. random maps that defines this Markov chain. Define 

St^ftiSt-i) and St^ft{St-i) for t = 1, . . . , n - 1. 

That is, we use the same realization of the functions ft on both copies of the chains, 
up to time n ~ 1. Suppose, at time n, we can find two copies /„ and /„ of /„, that are 
independent from everything earlier (but not independent of each other), such that, with 
high probability, we have fn(Sn-i) = /n (■S'ri-i)- (The name "one-shot coupling" refers 
to the fact that we only try to coalesce the two copies of the chain at the single time n.) 
By the representation (13), this would imply that Sn and Sn arc close to each other in 

TV distance. Two conditions help us to find such /„ and /„: first, Sn-i and 5'„_i need 
to be reasonably close; second, the density functions of the two random variables /t(x) 
and ftiy) need to have a large overlap when x and y are close. Theorem 12 is a precise 
refinement of this argument. 

In what follows, let (x, p) be a complete separable metric space and let P be a transition 
probability operator on the state space x- Assume that P has a density p with respect 
to some reference measure A (that is, P{x,dz) = p{x, z)X{dz)). Let be any probability 
distribution on x s^nd let tt be a stationary probability distribution for P. 

Theorem 12. (a) Assume that there is a constant A such that 

\p{x,z)-p{y,z)\X{dz)<Ap{x,y) forallx,y€x- (30) 

Then 

dTv(Ai^'",7r) < -dw{t^P''~\7r) for alln>l. 
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(b) Assume the following conditions hold: 

(i) there exists a function h> on x such that 

p(x ^ 

\p{x,z) ~p{y,z)\X{dz) < \ forallx,yex; (31) 

(ii) there exist positive constants B , q and Eq such that 

TT{{y:h{y)<e})<Be'^ for all e in {0, en). (32) 
Let C = (2g)-«/(i+«) niax{(q + l)B^/'^^+''\{Biea)-^/^^+''^}. Then 

dTv(/^P",7r) <C'[dw(/^^'""\7r)]«/(^+«' for all n> I. (33) 

Remarks. (1) If wc also know limsup„_j.o^[dw(M-P",7r)]^^" < P < 1, then the conditions 
of Theorem 12(b) imply that limsup„_,^[dTv(Ai-P",7r)]^/" < 

(2) Observe that condition (30) should not be expected to hold uniformly for x and 
y near in the random logistic model. Indeed, as x decreases to 0, the density of ft{x) 
becomes more and more peaked near 0. Essentially, this is because is a fixed point 
of the continuous random function ft. The same thing happens in the Gibbs sampler 
example when K is 0. 

(3) Lemma 3 will be useful in obtaining bounds of the form (30) or (31). 

Our first step in proving the above theorem is the following calculation. 

Lemma 13. Let rj and v he probability measures on x- Let be a probability measure 
m Joint(?7, J/). Then 

dTvivP, "P^^llll ^) - P^y^ z)|A(dz)vI/(dx, dy). (34) 

Proof. Since {r]P)(dz) = (J^ri{dx)p{x, z))X{dz) and similarly for i/P, we apply equation 
(14) to obtain 



drvivP: '^^^ = \J J Vi'ix)p{x, z) ~ j v{dy)p{y, z) 



X{dz) 



= \j^ jj^p{x,z)-^{dx,dy)-jjp{y,z)^{dx,dy) 
^\ \p{x,z)~p{y,z)\X{dz)^{dx,dy). 



A(dz) 



2 J J J \p\-^^'^j " p\y^'^)\'^y'^^j^\'~'^-^^'^y)- □ 

Proof of Theorem 12. We shall apply Lemma 13 with ?/ = fiP"~^ and ly = Tr {= ttP). 
Recall from Section 2 that there is a probability measure ^I' = 4*^^^ in Joint(77,!^) such 
that d\f^{ri,v) = j^jyp{x,y)^{dx,dy). The proof of part (a) follows immediately. 
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For part (b), let e > 0. Observe that the left-hand side of equation (31) is never greater 
than 2. Lemma 13 and the assumption (31) then imply that 



where 



and 



Note that 



dTy{7^P,iyP)<lA+lB, (35) 



'--III WMnT*(^-'^^) 

2 J J{x.,y :ma,x{h{x)My)}><^} max{/i(a;), h{y)\ 



Ib= I l*(dx,dy). 

' {x,y: max{/i(2;),h(y)}<e} 



Ia<\! I ^^(d^,dy)<^^(^'") 



2 J J {x,y:-ma.x{h(x),h{y)}>e} ^ 

and Ib < Tr{{y:h{y) < e}). Combining these bounds with the assumption (32) tells us 
that 

dTviflP" , TT) < ^w(/^P"-\^) _^ ^ g 

Let An = d-w{iiP"^^ ,tt) and consider the function G'„(e) = A„/(2e) + Se"^. Sim- 
ple calculus shows that G„ is minimized at e„ := i^^)^^^^^''^ and the minimum 
value of the function is G'„(e„) = CBqAl^'-^^'^\ where Cb, = (g + l)(Bq~''2-«)i/(i+9' . 
Let ao = 2Bqel'^'^. If A„ < ao, then e„ < Eq, so dTv{lJ-P",TT) < Gn{sn)- If > ao, 
then, trivially, dTv{pP",7r) < 1 < Thus equation (33) holds with 

C = max{CBg,ao'^*'^'^}- ^ 



4.2. Example 1: Normal Gibbs sampler 

We return to the Gibbs sampler example described in Section 1. Recall that we write 
Pk, Pk and ttk to denote the corresponding transition kernel, density and stationary 
distribution, where K £ {0, 1}, without loss of generality. 



Proposition 14. Let ^ be an arbitrary initial probability distribution on (0,oo). Then 
dTv(MA",^i)<^(l + ^)c?w(M^r\7ri) /orn = 1,2,... (36) 

dTY{piPi\7To)<Cdw{l^Pi'-\7Tor forn = l,2,..., (37) 
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where 

2a + J - 1 

W : 



2a + J+1 
and 



Before proceeding, let us revisit the numerical examples of Table 1, as discussed in the 
remarks following Proposition 11. 

(a) {K = 1) If dwifJ'Pi — Q^^ foi' some constants Q and 5, then dTvifJ-PiTTTi) < 
■^(1 + |y|/-\/27t)((5/5')S'". Thus, for the case where is the point mass at x = 1, we 
obtain the following upper bounds on d^YluP^ ^ni): 63.3(5/6)" in case A, 443(0.705)" 
in case B and 48,294(0.94)" in case C. Hence, the total variation distance to equilibrium 
is less then 0.01 when n > 49 in case A, when n > 31 in case B and when n > 249 in 
case C. 

(b) {K = 0) We have w = 11/13 in case A and w — 3/4 in cases B and C. Numerical 
values for C (rounded up) are 8722 in case A, 3.642 in case B and 20.96 in case C. If we 
know that dw{^^Po,'^o) < QS", then we obtain dTvi^iPo .t^o) < C[Q / S)"" [S"")"^ . Thus, 
for the case where ^ is the point mass at a; = 1 , we obtain the following upper bounds on 
d-TYiuP^ ,TTo): 5,958,000(0.144)" in case A, 174.6(0.338)" in case B and 1624(0.314)" in 
case C. Therefore, dTv(^'(7(lj Oi^o) < 0.01 for n > 11 in cases A and C, and for n > 10 
in case B. 

Logically, the proof of this proposition belongs at the end of this section since it relies 
on several lemmas that have not yet been proven. However, we shall present the proof 
now since it serves as a guide for what is to come. 

Proof of Proposition 14. Equation (36) follows from Theorem 12(a) and Lemma 16 
below. Equation (37) follows from Theorem 12(b) and Lemmas 17 and 18 below. In 
Theorem 12(b), we use q = a + {J — l)/2, B ~ and Sq ~ 1 (all courtesy of Lemma 
18), and it is not hard to check that, in the definition of C, the first term inside the 'max' 
exceeds the second. □ 

The proof of Lemma 18 relies on our knowledge of the explicit form of the equilibrium 
distribution (which is known in many MCMC problems). The proofs of Lemmas 16 and 
17 rely heavily on Lemma 3, together with the following technical lemma. 

Lemma 15. Let Z be a standard Normal random variable: 

(a) Let a and b be positive constants. Then d^vi^, < |a — 6|/ max{a, &}. 

(b) Let t be a real constant. Then dTv{Z, Z + t) < \t\/^/27X. 

Proof. For positive x, let (j)x{-) be the probability density function of Z/y/x, that is. 
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(a) Without loss of generality, assume that < a <b. Using equation (15) and e '^'^/^ > 
Q-bt /2 obtain 



2n V 27T / 



< 



1 



-U^I2 dt 



oo v27T 

Since 6 = max{a,6}, this proves part (a). 

(b) Let (j) ~ the probability density function of Z. Then 0(- — t) is the probability 
density function of Z + 1. By symmetry, we can assume that t > 0. Observe that the 
function min{0(u),0(u — t)} equals for u > t/2 and is symmetric (with respect to 
It) about u = t/2. Using this observation with equation (16) shows that 



/oo 
min{0(u), (f){u — t)} du 
-oc 



1-2 



(u) du - 



t/2 



t/2 ^ 

(t){u)du< — =, 

-t/2 V27t 



□ 



where we have used the bound 4>{u) < 1/V2n for all u. This proves part (b). 
Lemma 16 (K — 1). For all positive x and y, 

dTviPiix, ■),Pi{y, •)) < J\x - y|(l + \Y\/V2^). 



Proof. For given s > 0, pi{s, ■) is the probability density function of (8) with K ^1 and 
^ = 0. Therefore, Lemma 3 implies that 

dTv(Pi(x,-),-Pi(y,-)) < dTV ( ^ - 4= - T 
where a = xJ +1, b = yj + 1 and Z ~ A^(0, 1). Wc then have 



dTY{Pl{x,-),Pi{y,-))^dTv[ 



< C^TV 



Va' Vb 
Z Z 
A' Vb 



1 1 

a b 

Z Z 



Vb'Vb 



Y 



ab 
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Z Z\ f -\h-a^ 

ITV I , ^ + "TV I Z,Z + Y 



a 



Vb 



< — — + — ^ (by Lemma 15). 
maxja, b\ y/2naVb 

Finally, since \a — b\ = J\x — y\ and a, 6 > 1, the lemma follows. □ 

Lemma 17 (K — 0). For all positive x and y, 

1 f°° \x-v\ 
dTy{Po{x,-),Po{yr))^7; \po{x,z)-po{y,z)\dz< I ^1 



2 Jq maxjx, ?/| 

Proof. The equality in the lemma comes from equation (14). Recall from equation (9) 
that po{x,-) is the probability density function of G/(So + \\'^I\P^^-, where G has a 
particular Gamma distribution and Z has the standard Normal distribution. Therefore, 
Lemma 3 implies that dT^\{PQ{x, ■),Pf){y, •)) < dTv(:^, -^) and Lemma 15(a) completes 
the proof. □ 

Lemma 18 (K = 0). 7ro([0, e]) < c^oe^+^-^-i^/^ j^r all e in (0,1]. 

Proof. The density 7ro(s) is the integral over 9 of the posterior density p{9,s\Y), which 
is given by equation (2) with K ~0. Using equation (5), we see that 

p{9, s\Y) = ls"-i+-V2 exp[-s J(r - 6*) V2]e-"^« for s > and 9eR, 

where ^ = ^(a, J, Eg, Y) is the normalizing constant. Truncating the double integral that 
defines ( shows that 



C>e"^" / / s'^-^+'^Z^ cxp[^sJ{Y ^ 9f /2]d9ds. 

Jo J-oo 

Therefore, for e in (0,1], 



^o([0,e])<7/ / s"-i+-^/2exp[-sJ(y- 0)2/2] d^ds 



C7o J- 



oo 



/■■^ Q-3/2+,//2j 
/o S"-3/2+''/2 ds 

using exp[— s J(y — 9^/2] A9 = {2nJs)~'^^'^ in the second inequality. □ 

Remark. Although we did not do it, one can compute C, exactly when /v = 0. In most 
practical MCMC applications, the normalizing constant is hard to evaluate or even es- 
timate " which is one reason that people use MCMC instead of numerical analysis. In 
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general, finding constants B and eg for equation (32) can be hard. The above proof 
suggests one way to approach the challenge. 



4.3. Example 2: Random logistic maps 

Recall that we are considering i.i.d. random maps /i, /2, . • . on [0, 1] defined by 

where Bi ^ Beta(a + ^,a— ^) [a>^], and that the Beta(a, a) distribution is the unique 
stationary distribution for the iterated function system. 

In this subsection, we prove Theorem 1. The proof of this theorem is similar to the 
proof of the 'K = V part of Proposition 14. 

We begin with some notation. Let b{t) be the density of the S^'s, that is, 

^.^s ^ I /Var-i/2(i _ t)a-3/2 for < t < 1, 
\ otherwise, 

where /fa = r(2a)/r(a+i)r(a-i). Let 

Q{x) ^4:x{l- x) forO<a:<l. 

Observe that < < 1 for < a; < 1. For a given x e (0, 1), let bx{-) be the probability 
density function of BiQ(x), that is, 

otherwise. 

Next, let p{x, z) denote the transition density of the Markov chain corresponding to the 
iterated logistic maps. We then have 

p{x,z)=b^{z) for a;,ze [0, 1]. (38) 
Lemma 19. For the iterated logistic maps with a > 1/2, we have 

]- ( \p{x,z)-p{y,z)\dz< N M /or a;,yG (0,1). 

2 Jo max{g(a;),g(?/)} 

Proof. Without loss of generality, assume that < Q{x) < Q{y)- By equation (38), 
Proposition 2 and some calculation similar to that which was involved in the proof of 
Lemma 15, we have 



2 Jo 



1 

\p{x,z) -p{y,z)\dz 
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{bxiz) -~by{z))dz 

(since Q{y) - z > Q{x) - z>G) 

<(i-(^i 1/ — — „\"r^ . ^'\ z 



(39) 



/ _ [^'■''^ KaZ-^/\Q{x)^zY- 

\ \Q{y)) J Jo Q(x)2-i 



VQ(y)y ■ 

Wc now observe that for p > 0, 

vP -uP <max{p,l}vP'^\v-u\ for w > m > (40) 

(for < p < 1 , this is simple algebra and for p> 1, this follows from applying the mean 
value theorem to the function 1 1-^ t^). Next, since |(9'(a;)| = |4 — 8x\ < 4, the mean value 
theorem implies that 

\Q{y)-Q{x)\<4\y-x\ for X, ye [0,1]. (41) 
Finally, for < Q{x) < Q{y), equations (39)-(41) imply that 

-j^ b(.,z)-p(y,z)|dz< 

max{(2a-l),l}|Q(y)-Q(a;)| 



< 



< 



Q{y) 

[(2a-l) + l]4|2y-a;| 



Q{y) 

This proves the lemma. □ 



We can now apply Theorem 12(b) as follows. Let n = Sx (point mass at a;) and let 
TTa be the equilibrium (3a^a distribution. Also, let A be Lebesgue measure and let the 
function h{-) be Q(-)/(16a). Lemma 19 then proves condition (i) of Theorem 12(b). For 
condition (ii), we need to estimate Tra{{y E [0,1] : h{y) < e}) for small positive e. Let 
A = 16a. Observe that if Qiy)/A < e and < y < 1/2, then Ae > 4y{l - y) > 4j/(l/2), 
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so y < Ae/2. Similarly, if Q{y)/A < £ and 1/2 < y < 1, then y > 1 - Ae/2. Therefore, for 
a > 1 and < e < we have 

[0,1] :%)<£}) 

= 7r,([0, Ae/2]) +7r„([l- Ae/2, 1]) 

= 27ra([0, Ae/2]) (since tTq is symmetric about 1/2) 

^Ka ^ t)''-^ dt (where A"a = r(2a)/r(a)2) (42) 

Jo 

.A6/2 

<Ka (43) 
Jo 

a 

Therefore, equation (32) holds with q^a, B = K'aS'^a"^^^ and eo = l/(16a). For 1/2 < a < 
1, everything is the same except that we use the bound (1 — t)"-^^ < 2^~° for < t < 1/2 
in the integrand of (42), obtaining an extra multiplicative factor of 2^~° in equation (43) 
and hence B = 2Ka'^°'a°'~^ ■ We have thus shown that Theorem 1 follows from Theorem 
12(b). 
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